You should have received an email about signing up for the GVPT experimental lab
Check the exam 1 module for a list of things to know for exam 1. Be ready to ask about them on Monday.
We’ll start on introducing material from chapter 6. This will not be on the upcoming midterm, but we’ll revisit it after the break
We’ll talk about some math, but ultimately we don’t need a lot beyond square roots to understand what’s going on. The more important part is the intuition.
Note: the general principles here apply to analyses of non-survey data as well, but we’ll talk in terms of surveys because that tends to be a bit more familiar for people (or at least gov majors)
Remember that:
We want to talk about the population: the beliefs and behaviors of all eligible voters, or all countries, or all adults. We’ll call characteristics of this group (mean, standard deviation, median etc) population parameters
But we typically only have data on a sample from the population. We’ll call characteristics of this group sample statistics
We don’t expect the a sample mean to exactly equal a population mean. But we expect it to have some random error.
Sample Statistic = Population Parameter \(\pm\) Random Error
We don’t know the actual size of the error for a given statistic, but we can estimate the probability of an error greater than some value.
The general intuition:
more data = more certainty. Estimates from a really big representative sample are going to be more likely to approximate the true population mean.
less variance = more certainty. If people in the target population are pretty much the same, then it is more likely that a given sample will resemble the target population.
from https://criticalissues.umd.edu/feature/new-study-change-us-public-attitudes-towards-jews-and-muslims-2022-2024
from https://criticalissues.umd.edu/feature/new-study-change-us-public-attitudes-towards-jews-and-muslims-2022-2024
The margin of error is more frequently called a “confidence interval” in statistics.
A confidence interval is a range that we are reasonably certain contains the actual population value.
Specifically, the margin generally reported in a survey is a 95% confidence interval when there’s a 50/50 split in a dichotomous outcome.
If I estimate that 50% of respondents are Democrats with a margin of error of +/- 3% then I’m saying I’m 95% certain that the range 47-53% contains the true population % of Dems.
95% confidence means: “if I repeated this survey an infinite number of times and recalculated the CI for each one, 95% of my calculated confidence intervals would contain the actual population value.”
Central Limit Theorem
As the number of repeated samples approaches infinity, the sampling distribution of the sample mean will converge toward a normal distribution centered on the population mean.
Think about flipping a coin:
the probability of landing on heads is 50% (this is our population parameter)
We’ll run an “experiment” where we flip a coin 50 times and calculate the proportion of heads.
This is hardly surprising!
the proportion of heads is 0.38.
The error is .5 - 0.38 = -0.12
I get closer to the correct answer when I increase the sample size, but I still will have some amount of error:
the proportion of heads is 0.51.
The error is .5 - 0.51 = 0.01
Repeating this experiment 5,000 times will give us a sense of the sampling distribution for this sample statistic:
# 5000 "samples" of the coin-flipping experiment
trials<-replicate(5000, sample(c("heads","tails"), size=50, replace=TRUE)=="heads")
# get the average num of heads in each trial. Expected value = .5
mean_heads <- colMeans(trials)
# draw a histogram
hist(mean_heads, main ='proportion of heads in 50 coin flips', breaks=30)Coin flips technically follow a binomial distribution but our sampling distribution looks a lot like a normal “bell-curve” centered on .5
In fact, for a sufficient sample size, sampling distributions of sample means will always end up looking like this regardless of the actual population distribution.
We know a lot about bell curves. If we have mean \(\mu\) and a standard deviation \(\sigma\), we can easily find how much data we expect to fall anywhere within this shape using the probability density function.
\[f(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp\left( -\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^{\!2}\,\right)\]
(Maybe not the kind of thing you want to calculate in your head, but pretty easy for a computer)
The mean (\(\mu\)) and standard deviation \(\sigma\) will depend on the data, but we can always:
We’ll call this new number the “Z-score”:
\[ \text{Z score} = \frac{\text{Deviation from the mean}}{\text{Standard Deviation}} \]
Central Limit Theorem
As the number of repeated samples approaches infinity, the sampling distribution of the mean will converge toward a normal distribution centered on the population mean.
This means that:
We don’t need to take infinite samples to make use of this! We can draw a sample and assume that it comes from a distribution with this shape:
\[f(x) = \frac{1}{\color{red}{\sigma}\sqrt{2\pi}} \exp\left( -\frac{1}{2}\left(\frac{x-\color{red}{\mu}}{\color{red}{\sigma}}\right)^{\!2}\,\right)\]
Remember: we need to know \(\mu\) and \(\sigma\)
We don’t know our population mean \(\mu\), but we do know that the mean sampling error is zero, so we can just use 0 our expected mean error.
However, we still need to know the standard deviation \(\sigma\)
Here, we’re looking for the standard deviation of the sampling errors. We’ll call this value our standard error or SE for short.
Finding this is a challenge, but we can make some simple assumptions here to get a “worst case scenario” for a survey
Remember that more observations = more certainty. So we expect that the standard error will decrease when taking larger samples.
We actually know how much the SE shrinks for each observation:
\[ \text{Standard error (SE)} = \frac{\sigma}{\sqrt{n}} \]
\[ \text{Standard error (SE)} = \frac{\sigma}{\sqrt{n}} \]
We don’t know \(\sigma\), but surveys are usually reporting a proportion or percentage of people giving a particular response. The standard error of a proportion actually depends on the proportion itself:
\[ \text{Standard Error of a proportion} = \frac{\sqrt{p\times(1-p)}}{\sqrt{n}} = \sqrt{\frac{p\times(1-p)}{n}} \]
\[ p = \text{the proportion} \]
Note that this part of the formula is at its maximum when p = .5 (i.e. when there’s a 50/50 split on a yes/no question)
\[ \text{Standard Error of a proportion} = \sqrt{\frac{\color{red}{p\times(1-p)}}{n}} \]
So, if we know our sample size, we can calculate the worst case scenario for the standard error where \(p = .5\)
We have all the ingredients to we need to apply the CLT here!
Remember we want a 95% confidence level.
We already know that for this pdf, 95% of the errors will fall with in approximately \(\pm 2 \times Z\) of 0 (technically closer to 1.95994…)
\[ \text{Standard Error of a proportion} = \sqrt{\frac{p\times(1-p)}{n}} \]
\[ \text{SE for this sample at (p=0.5)} = \sqrt{\frac{.5(1-.5)}{2,091}} \approx 0.0109 \]
So now we just need:
\[ \text{standard error} \times \text{Z-score for a 95% confidence level} = \]
\[ .0109 \times 1.96 \approx 0.0214 = 2.14\% \]
So what is the margin of error on a poll?
Its a 95% confidence interval for a worst case scenario where there’s a 50/50 split on a question with two response categories.
When I take a decent-sized random sample from the population:
I know the sample mean (\(\bar{x}\)) is drawn from a normal distribution.
I know the mean of this sampling distribution is equal to the population mean.
I know the standard deviation of this sampling distribution (a.ka. the standard error) is equal to \(\frac{\sigma}{\sqrt{n}}\)
\[ \text{lower 95% confidence boundary} = \bar{x} - 1.96 \text{ standard errors} \]
\[ \text{upper 95% confidence boundary} = \bar{x} + 1.96 \text{ standard errors} \]
Sometimes we’ll just round this to “2” because its usually good enough.
The margin of error is a worst case scenario. Its convenient to have one number that can apply to the entire survey, but we can generally be more precise with our confidence intervals.
We often want to estimate confidence for things that aren’t proportions, so how do we do this without knowing the population value for \(\sigma\)?